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We present the current status of CMS data analysis architecture and describe work on future Grid-based 
distributed analysis prototypes. CMS has two main software frameworks related to data analysis: COBRA, 
the main framework, and IGUANA, the interactive visualisation framework. Software using these frameworks 
is used today in the world-wide production and analysis of CMS data. We describe their overall design and 
present examples of their current use with emphasis on interactive analysis. CMS is currently developing remote 
analysis prototypes, including one based on Clarens, a Grid-enabled client-server tool. Use of the prototypes by 
CMS physicists will guide us in forming a Grid-enriched analysis strategy. The status of this work is presented, 
as is an outline of how we plan to leverage the power of our existing frameworks in the migration of CMS 
software to the Grid. 
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1. Introduction 



2. CMS Analysis Environment 
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Requirements on CMS software and computing re- 
sources pj will far exceed those of any existing high en- 
ergy physics experiment, not only because of the com- 
plexity of the detector and of the physics task but also 
for the size and the distributed nature of the collabo- 
ration (today encompassing 2000 physicists from 150 
institutions in more than 30 countries) and the long 
time scale (20 or more years). It was widely recog- 
nised from the outset of planning for the LHC Exper- 
iments in the mid-1990s, that the computing systems 
required to collect, analyse and store the physics data 
would need to be distributed and global in scope. In 
particular, studies of the computing industry and its 
expected development showed that utilising comput- 
ing resources external to CERN, at the collaborating 
institutes (as had been done on a limited scale for the 
LEP experiments) would continue to be an essential 
strategy, and that a global computing system archi- 
tecture would need to be developed. 

Therefore, since 1995 the CMS computing group 
has been engaged in an extensive R&D program to 
evaluate and prototype a computing architecture that 
will allow to perform the task of collecting, recon- 
structing, distributing and storing the physics data 
correctly and efficiently in the LHC demanding envi- 
ronment, and, at the same time, to present the user, 
either local or remote, with a simple logical view of all 
objects needed to perform physics analysis or detector 
studies. 

More recently, CMS undertook a major require- 
ments and consensus building effort to modernise this 
vision of a distributed computing model to a Grid- 
based computing infrastructure. Accordingly, the cur- 
rent vision sees CMS computing as an activity that is 
performed on the "CMS Data Grid System" whose 
properties have been described in considerable detail 
U • The CMS Data Grid System specifies a division 
of labour between the various Grid projects and the 
CMS core computing project. 



Figure [I] shows the current model of the CMS Anal- 
ysis Environment. 
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Figure 1: CMS Analysis Environment 

CMS goal is to deploy a Coherent Analysis Envi- 
ronment that goes beyond the interactive analysis tool 
that provides classical data analysis and presentation 
functionalities such as N-tuples, histograms, fitting, 
plotting. We aim to allow an easy and coherent ac- 
cess to the great range of other activities typical of the 
development of physics analysis and detector studies 
in HEP. Support will be provided for both batch and 
interactive work with interface ranging from "pointy- 
clicky" to Emacs-like power tool to scripting. The en- 
vironment will encompass configuration management 
tools, application frameworks, simulation, reconstruc- 
tion and analysis packages. 

Most of the activity will be seen as operations on 
data-stores: Replicating entire data-stores; Copying 
runs, events, event parts between stores. A "copy" 
operation will often encompass something more com- 
plicated such as filtering, reconstruction, analysis. 

A particular enphasys will be placed on browsing 
data-stores down to object detail level including sup- 
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port for 2D and 3D visualisation. 

A final aim of such a coherent environment will be 
the ability to move and share code across final analy- 
sis, reconstruction and triggers. 



3. CMS Software System 
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Figure 2: CMS Software System and its compoments: the 
arrow shows the layers that LCG will enventually cover. 

In order to achieve such a Coherent Analysis En- 
vironment the software system has to provide a Con- 
sistent User interface on one side and a coherent set 
of basic tools and mechanisms on the other. CMS 
software system, presented in figure [2 is based on a 
modular and layered architecture |3j centred around 
the COBRAfl and IGUANA0 frameworks. It relies 
on a high-quality service and utility toolkit provided 
by the LCG project 6] in the form of either certified 
and maintained external software components or as 
packages developed by the project itself. 

The framework defines the top level abstractions, 
their behaviour and collaboration patterns. It com- 
prises two components: a set of classes that capture 
CMS specific concepts like detector components and 
event features and a control policy that orchestrates 
the instances of those classes taking care of the flow of 
control, module scheduling, input/output, etc. This 
control policy is tailored to the task in hand and to 
the computing environment. 

The physics and utility modules are written by de- 
tector groups and physicists. The modules can be 
plugged into the application framework at run time, 
independently of the computing environment. One 
can easily choose between different versions of various 
modules. The physics modules do not communicate 
with each other directly but only through the data 
access protocols that are part of the framework itself. 

Both the application framework and the service 
and utility toolkit shield the physics software modules 
from the underlying technologies which will be used 
for the computer services. This will ensure a smooth 
transition to new technologies with changes localised 



in the framework and in specific components of the 
service toolkit. 



4. Distributed Analysis 
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Figure 3: Distributed Analysis Architecture. 

Physicists analysing data will access a great variety 
of sources and services ranging from simple histograms 
and n-tuples, to Relational Data-Bases. Analysis ac- 
tivity may include simple "tag" selections, more com- 
plex SQL-like queries up to running CMS simula- 
tion, calibration and/or reconstruction applications. 
A combination of generic grid tools and specialised 
CMS tools will be required to optimise the use of com- 
puting resources. Figure |2l shows the current view of 
CMS architecture for Distributed Analysis. Physicists 
will use a single access point that will allow them to 
gain access to the required resources and services de- 
pending both on the task in hand and the nature of 
the data. An Analysis Server, such as the Clarens Re- 
mote Dataserver Q described in next section, will act 
as mediator between the user, and his front-end ap- 
plication, and the back-end services. These three-tier 
architecture realises a clean separation between the 
user environment and the service environment. This 
will enable the user to maintain a consistent person- 
alised environment while the various services may be 
configured and deployed according to policies specific 
to each provider site. 



4.1. The Clarens Remote Analysis Server 

The Clarens Remote Dataserver Q project aims to 
build a wide-area network client /server system for re- 
mote access to a variety of data and analysis services. 

The first service envisaged (and partly imple- 
mented) is analysis of events from Spring 2002 pro- 
duction currently stored in centralised Objectivity 
databases, but in future may be stored in a combi- 
nation of relational databases and/or files. Other ser- 
vices include remote access to Globus functionality for 
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non-Globus clients, including file transfer, replica cat- 
alog access and job scheduling. 

The concept of a "server" is currently the tradi- 
tional interpretation of the word, but might in future 
also refer to the multi-tiered data-GRID for data ac- 
cess and analysis. 

Communication between the client and server is 
conducted via the lightweight XML-RPC remote 
procedure call mechanism. This was chosen both for 
its simplicity, good standardisation, and wide support 
by almost all programming languages. Work is also 
underway to enable communication using the widely 
supported SOAP protocol. 

The server is implemented using a modular com- 
bination of Python and C++ that executes directly 
in the Apache ^(J we b server's address space. This 
increased performance roughly by a factor of five 
w.r.t. a cgi-bin implementation. Clients connect- 
ing to the server are authenticated using Grid certifi- 
cates, while the server can likewise be authenticated 
by clients. Client session data is tracked using the 
high-performance embedded Berkeley database. Ses- 
sions can optionally be conducted over strongly en- 
crypted SSL/TLS connections. Authentication infor- 
mation is always strongly encrypted, irrespective of 
whether the session is encrypted. 

The modular nature of the server allows function- 
ality to be added to a running server without tak- 
ing it off-line by way of drop-in components writ- 
ten in Python of a combination of C++ and Python. 
An example module that allows browsing and down- 
loading of histograms and Tag objects in an Objectiv- 
ity database federation has been implemented in this 
way. 

The multi-process model of the underlying server 
(Apache) is uniquely suited to handling large vol- 
umes of clients as well as long-running client requests. 
These server processes are protected from other mali- 
cious or faulty requests made by other clients as each 
process runs in its own address space. 

There are currently three clients that are evolving 
along with the server: 

• Python command line 

• C-l — h command line client as an extension of the 
ROOT analysis environment 

• Python GUI client in the SciGraphica analysis 
environment 

Support for the Java Analysis Studio client has been 
dropped in favour of the ROOT client due to the lat- 
ter 's greater popularity in the CMS community. A 
web-based front-end that supports the new authenti- 
cation mechanism will be deployed in the near future 
to replace the current non-authenticated version. 

The command-line Python client may also be used 
directly in any Python-based analysis environment 
such as the IGUANA scripting service. 



5. Data Challenges 

In order to test its Software system and its comput- 
ing model CMS is engaged in an aggressive program 
of "data challenges" of increasing complexity: 

• Data Challenge '02 

Focus on High-Level Trigger studies 

• Data Challenge '04 

Focus on "real-time" mission critical tasks 

• Data Challenge '06 

Focus on distributed physics analysis 

Even if each data challenge is focus on a given aspect, 
all encompass the whole data analysis process: Simu- 
lation, reconstruction and statistical analysis running 
either as organised production, end-user batch job or 
interactive work. 

5.1. Spring 2002 Production 

In Spring 2002 the CMS production team completed 
a production of Monte-Carlo data f° r the DAQ 
TDR [H. 

Almost 6 million events were simulated 
with the CMS GEANT3-based detector simulation 
program CMSIM^| and then digitised and recon- 
structed under a variety of conditions with the CMS 
Object Oriented reconstruction and analysis program 
ORCA0 based on the COBRA framework. About 
20 regional centres participated in this work running 
about 100 000 jobs for a total of 45 years CPU (wall- 
clock) distributed over about 1000 cpus. 20TB of data 
were produced. Of these more than 10TB travelled 
on the Wide area network. More than 100 physics 
were involved in the final analysis. PAW, and in a 
later stage ROOT, were the preferred analysis tools. 
Less specialised tools, such as Mathematica and Excel, 
were also used for some analyses. IGUANA was used 
for interactive graphical inspection of detector compo- 
nents and of single events. The result of this work 12J 
was the successful validation of CMS High-Level Trig- 
ger algorithms including rejection factors, computing 
performance and functionality of the reconstruction- 
framework. 

5.2. Data Challenge in 2004 (DC04) 

The previous data challenge has concentrated on 
the simulation and digitisation phase and the prepara- 
tion of datasets for asynchronous analysis. With this 
challenge CMS intends to perform a large-scale test of 
the computing and analysis models themselves. Thus 
we anticipate a pre-challenge period comprising the 
preparation of the data samples, while the challenge 
itself consists of the reconstruction and selection of the 
data at the TO (Tier-0 computing centre at CERN), 
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with distribution to the distributed T1/T2 sites and 
synchronous analysis. 

DC04 15] will run in the LCG production 
grid. We anticipate that the TO part of the chal- 
lenge is entirely CERN based, but the subsequent data 
replication and analysis run in a common prototype 
grid environment. All T0-T2 resources of the Data 
Challenge should be operating in a common GRID 
prototype [Tij. 

In this data challenge, to be completed in April 
2004, CMS will reconstruct 50 million events in real 
time coping with a data rate equivalent to an event 
data acquisition running at 25 Hz for a luminosity 
2 x 10 33 cms~ 2 s~ 1 for one month. Besides this pure 
computation goal CMS will use this opportunity to 
test the software system, the event model, define and 
validate datasets for analysis, identify reconstruction 
and analysis objects each physics group would like to 
have for the full analysis, develop selection algorithms 
necessary to obtain the required sample, and prepare 
for "mission critical" analysis, calibration and align- 
ment. 



6. Summary 

CMS is actively pursuing a development and test ac- 
tivity to be able to deploy a fully distributed comput- 
ing architecture in time for the LHC start-up. A dis- 
tributed production and analysis of 6 million Monte- 
Carlo events have been successfully completed in 2002 
involving 20 regional centres. Next major data chal- 
lenge is scheduled for the year 2004 and comprises a 
complexity scale equal to about 5% of that foreseen 
for LHC high luminosity running. It will run in the 
LCG production grid and will be used to exercise a 
fully distributed physics analysis using tools such as 
the Clarens Remote Dataserver running in conjunc- 
tion with CMS specific applications based on the CO- 
BRA and IGUANA frameworks. 
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